AITopics | hierarchical vision transformer

Collaborating Authors

hierarchical vision transformer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Polyhistor: Parameter-Efficient Multi-Task Adaptation for Dense Vision Tasks

Neural Information Processing SystemsAug-19-2025, 17:53:52 GMT

Adapting large-scale pretrained models to various downstream tasks via fine-tuning is a standard method in machine learning.

machine learning, natural language, trainable parameter, (16 more...)

Neural Information Processing Systems

Country: Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)

Genre: Research Report (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Nested-TNT: Hierarchical Vision Transformers with Multi-Scale Feature Processing

Liu, Yuang, Qiu, Zhiheng, Qin, Xiaokai

arXiv.org Artificial IntelligenceApr-20-2024

Transformer has been applied in the field of computer vision due to its excellent performance in natural language processing, surpassing traditional convolutional neural networks and achieving new state-of-the-art. ViT divides an image into several local patches, known as "visual sentences". However, the information contained in the image is vast and complex, and focusing only on the features at the "visual sentence" level is not enough. The features between local patches should also be taken into consideration. In order to achieve further improvement, the TNT model is proposed, whose algorithm further divides the image into smaller patches, namely "visual words," achieving more accurate results. The core of Transformer is the Multi-Head Attention mechanism, and traditional attention mechanisms ignore interactions across different attention heads. In order to reduce redundancy and improve utilization, we introduce the nested algorithm and apply the Nested-TNT to image classification tasks. The experiment confirms that the proposed model has achieved better classification performance over ViT and TNT, exceeding 2.25%, 1.1% on dataset CIFAR10 and 2.78%, 0.25% on dataset FLOWERS102 respectively.

proceedings, transformer, vision transformer, (14 more...)

arXiv.org Artificial Intelligence

2404.13434

Country:

Asia > Singapore (0.05)
Asia > China > Shaanxi Province > Xi'an (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

Hierarchical Vision Transformers for Context-Aware Prostate Cancer Grading in Whole Slide Images

Grisi, Clément, Litjens, Geert, van der Laak, Jeroen

arXiv.org Artificial IntelligenceDec-19-2023

Vision Transformers (ViTs) have ushered in a new era in computer vision, showcasing unparalleled performance in many challenging tasks. However, their practical deployment in computational pathology has largely been constrained by the sheer size of whole slide images (WSIs), which result in lengthy input sequences. Transformers faced a similar limitation when applied to long documents, and Hierarchical Transformers were introduced to circumvent it. Given the analogous challenge with WSIs and their inherent hierarchical structure, Hierarchical Vision Transformers (H-ViTs) emerge as a promising solution in computational pathology. This work delves into the capabilities of H-ViTs, evaluating their efficiency for prostate cancer grading in WSIs. Our results show that they achieve competitive performance against existing state-of-the-art solutions.

hierarchical vision transformer, transformer, vision transformer, (11 more...)

arXiv.org Artificial Intelligence

2312.12619

Country: Europe > Netherlands (0.04)

Genre:

Research Report > Promising Solution (0.87)
Research Report > New Finding (0.87)

Industry:

Health & Medicine > Therapeutic Area > Urology (0.62)
Health & Medicine > Therapeutic Area > Oncology > Prostate Cancer (0.62)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Predicting Ovarian Cancer Treatment Response in Histopathology using Hierarchical Vision Transformers and Multiple Instance Learning

Breen, Jack, Allen, Katie, Zucker, Kieran, Hall, Geoff, Ravikumar, Nishant, Orsi, Nicolas M.

arXiv.org Artificial IntelligenceOct-19-2023

For many patients, current ovarian cancer treatments offer limited clinical benefit. For some therapies, it is not possible to predict patients' responses, potentially exposing them to the adverse effects of treatment without any therapeutic benefit. As part of the automated prediction of treatment effectiveness in ovarian cancer using histopathological images (ATEC23) challenge, we evaluated the effectiveness of deep learning to predict whether a course of treatment including the antiangiogenic drug bevacizumab could contribute to remission or prevent disease progression for at least 6 months in a set of 282 histopathology whole slide images (WSIs) from 78 ovarian cancer patients. Our approach used a pretrained Hierarchical Image Pyramid Transformer (HIPT) to extract region-level features and an attention-based multiple instance learning (ABMIL) model to aggregate features and classify whole slides. The optimal HIPT-ABMIL model had an internal balanced accuracy of 60.2% +- 2.9% and an AUC of 0.646 +- 0.033. Histopathology-specific model pretraining was found to be beneficial to classification performance, though hierarchical transformers were not, with a ResNet feature extractor achieving similar performance. Due to the dataset being small and highly heterogeneous, performance was variable across 5-fold cross-validation folds, and there were some extreme differences between validation and test set performance within folds. The model did not generalise well to tissue microarrays, with accuracy worse than random chance. It is not yet clear whether ovarian cancer WSIs contain information that can be used to accurately predict treatment response, with further validation using larger, higher-quality datasets required.

hierarchical vision transformer, histopathology, ovarian cancer treatment response, (1 more...)

arXiv.org Artificial Intelligence

2310.12866

Genre: Research Report (0.40)

Industry: Health & Medicine > Therapeutic Area > Oncology > Ovarian Cancer (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.85)

Add feedback

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

Ryali, Chaitanya, Hu, Yuan-Ting, Bolya, Daniel, Wei, Chen, Fan, Haoqi, Huang, Po-Yao, Aggarwal, Vaibhav, Chowdhury, Arkabandhu, Poursaeed, Omid, Hoffman, Judy, Malik, Jitendra, Li, Yanghao, Feichtenhofer, Christoph

arXiv.org Artificial IntelligenceJun-1-2023

Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance. While these components lead to effective accuracies and attractive FLOP counts, the added complexity actually makes these transformers slower than their vanilla ViT counterparts. In this paper, we argue that this additional bulk is unnecessary. By pretraining with a strong visual pretext task (MAE), we can strip out all the bells-and-whistles from a state-of-the-art multi-stage vision transformer without losing accuracy. In the process, we create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models while being significantly faster both at inference and during training. We evaluate Hiera on a variety of tasks for image and video recognition. Our code and models are available at https://github.com/facebookresearch/hiera.

hiera, hierarchical vision transformer, mae, (13 more...)

arXiv.org Artificial Intelligence

2306.00989

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Research Papers based on Masked Image Modeling part1 (Computer Vision)

#artificialintelligenceJun-4-2022, 17:15:25 GMT

Abstract: Recently, masked image modeling (MIM) has offered a new methodology of self-supervised pre-training of vision transformers. A key idea of efficient implementation is to discard the masked image patches (or tokens) throughout the target network (encoder), which requires the encoder to be a plain vision transformer (e.g., ViT), albeit hierarchical vision transformers (e.g., Swin Transformer) have potentially better properties in formulating vision inputs. In this paper, we offer a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT) that enjoys both high efficiency and good performance in MIM. The key is to remove the unnecessary "local inter-unit operations", deriving structurally simple hierarchical vision transformers in which mask-units can be serialized like plain vision transformers. For this purpose, we start with Swin Transformer and (i) set the masking unit size to be the token size in the main stage of Swin Transformer, (ii) switch off inter-unit self-attentions before the main stage, and (iii) eliminate all operations after the main stage.

masked image modeling, transformer, vision transformer, (9 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Vision (1.00)

Add feedback

Swin Transformer 🚀: Hierarchical Vision Transformer using Shifted Window -- Part I

#artificialintelligenceFeb-13-2022, 07:40:15 GMT

So Facebook AI's team came up with DeiT, which is a data-efficient transformer and was able to out-perform SOTA convolutional networks and ViTs, in terms of accuracy/FLOPs trade-off. DeiT was trained on no external data but just ImageNet21. But it used distillation and depended on a convolution network for knowledge distillation, so was not completely a convolution-free solution. Both DeiT and ViT, were just tested and designed for Image classification, with the general perception that, if a network architecture performs good for the image classification task, it is expected to do good on others because, "image classification is used as a benchmark for measuring the progress of a technique in the vision domain, any progress here translates to downstream tasks like detection and segmentation". There is no other work in my knowledge, that used ViT or DeiT as a feature extraction backbone, for tasks other than classification.

classification, hierarchical vision transformer, shifted window, (2 more...)

#artificialintelligence

Country: Asia (0.08)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.41)

Add feedback